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Abstract —We study a new notion of graph centrality based on 
absorbing random walks. Given a graph G — {V^E) and a set of 
query nodes Q C U, we aim to identify the k most central nodes 
in G with respect to Q. Specifically, we consider central nodes to 
be absorbing for random walks that start at the query nodes Q. 
The goal is to find the set of k central nodes that minimizes the 
expected length of a random walk until absorption. The proposed 
measure, which we call k absorbing random-walk centrality^ favors 
diverse sets, as it is beneficial to place the k absorbing nodes in 
different parts of the graph so as to “intercept” random walks 
that start from different query nodes. 

Although similar problem definitions have been considered in 
the literature, e.g., in information-retrieval settings where the 
goal is to diversify web-search results, in this paper we study 
the problem formally and prove some of its properties. We show 
that the problem is NP-hard, while the objective function is 
monotone and supermodular, implying that a greedy algorithm 
provides solutions with an approximation guarantee. On the other 
hand, the greedy algorithm involves expensive matrix operations 
that make it prohibitive to employ on large datasets. To confront 
this challenge, we develop more efficient algorithms based on 
spectral clustering and on personalized PageRank. 
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I. Introduction 

A fundamental problem in graph mining is to identify the 
most central nodes in a graph. Numerous centrality measures 
have been proposed, including degree centrality, closeness 
centrality [14], betweenness centrality [5], random-walk cen¬ 
trality [13], Katz centrality [9], and PageRank [4]. 

In the interest of robustness many centrality measures use 
random walks: while the shortest-path distance between two 
nodes can change dramatically by inserting or deleting a single 
edge, distances based on random walks account for multiple 
paths and offer a more global view of the connectivity between 
two nodes. In this spirit, the random-walk centrality of one 
node with respect to all nodes of the graph is defined as the 
expected time needed to come across this node in a random 
walk that starts in any other node of the graph [13]. 

In this paper, we consider a measure that generalizes 
random-walk centrality for a set of nodes C with respect to 
a set of query nodes Q. Our centrality measure is defined as 
the expected length of a random walk that starts from any 
node in Q until it reaches any node in C — at which point 
the random walk is ''absorbed” by C. Moreover, to allow for 


adjustable importance of query nodes in the centrality measure, 
we consider random walks with restarts, that occur with a fixed 
probability a at each step of the random walk. The resulting 
computational problem is to find a set of k nodes C that 
optimizes this measure with respect to nodes Q, which are 
provided as input. We call this measure k absorbing random- 
walk centrality and the corresponding optimization problem 
/c-arw-Centrality. 

To motivate the /c-arw-Centrality problem, let us con¬ 
sider the scenario of searching the Web graph and summa¬ 
rizing the search results. In this scenario, nodes of the graph 
correspond to webpages, edges between nodes correspond to 
links between pages, and the set of query nodes Q consists 
of all nodes that match a user query, i.e., all webpages that 
satisfy a keyword search. Assuming that the size of Q is large, 
the goal is to find the k most central nodes with respect to Q, 
and present those to the user. 

It is clear that ordering the nodes of the graph by their 
individual random-walk centrality scores and taking the top-k 
set does not solve the /c-arw-Centrality problem, as these 
nodes may all be located in the same “neighborhood” of the 
graph, and thus, may not provide a good absorbing set for 
the query. On the other hand, as the goal is to minimize the 
expected absorption time for walks starting at Q, the optimal 
solution to the /c-arw-Centrality problem will be a set of 
k, both centrally-placed and diverse, nodes. 

This observation has motivated researchers in the informa¬ 
tion-retrieval field to consider random walks with absorbing 
states in order to diversify web-search results [18]. However, 
despite the fact that similar problem definitions and algorithms 
have been considered earlier, the /c-arw-Centrality prob¬ 
lem has not been formally studied and there has not been a 
theoretical analysis of its properties. 

Our key results in this paper are the following: we show 
that the /c-arw-Centrality problem is NP-hard, and we 
show that the k absorbing random-walk centrality measure 
is monotone and supermodular. The latter property allows us 
to quantify the approximation guarantee obtained by a natural 
greedy algorithm, which has also been considered by previous 
work [18]. Furthermore, a naive implementation of the greedy 
algorithm requires many expensive matrix inversions, which 
make the algorithm particularly slow. Part of our contribu- 



tion is to show how to make use of the Sherman-Morrison 
inversion formula to implement the greedy algorithm with 
only one matrix inversion and more efficient matrix x vector 
multiplications. 

Moreover, we explore the performance of faster, heuristic 
algorithms, aiming to identify methods that are faster than 
the greedy approach without significant loss in the quality of 
results. The heuristic algorithms we consider include the per¬ 
sonalized PageRank algorithm [4], [10] as well as algorithms 
based on spectral clustering [17]. We find that, in practice, the 
personalized PageRank algorithm offers a very good trade-off 
between speed and quality. 

The rest of the paper is organized as follows. In Section II, 
we overview previous work and discuss how it compares 
to this paper. We define our problem in Section III and 
provide basic background results on absorbing random walks 
in Section IV. Our main technical contributions are given in 
Sections IV and V, where we characterize the complexity of 
the problem, and provide the details of the greedy algorithm 
and the heuristics we explore. We evaluate the performance of 
algorithms in Section VII, over a range of real-world graphs, 
and Section VIII is a short conclusion. Proofs for some of the 
theorems shown in the paper are provided in the Appendix. 

II. Related work 

Many works in the literature explore ways to quantify the 
notion of node centrality on graphs [3]. Some of the most 
commonly-used measures include the following: (i) degree 
centrality, where the centrality of a node is simply quantified 
by its degree; (ii) closeness centrality [11], [14], defined 
as the average distance of a node from all other nodes on 
the graph; {in) betweenness centrality [5], defined as the 
number of shortest paths between pairs of nodes in the graph 
that pass through a given node; {iv) eigenvector centrality, 
defined as the stationary probability that a Markov chain on 
the graph visits a given node, with Katz centrality [9] and 
PageRank [4] being two well-studied variants; and (v) random- 
walk centrality [13], defined as the expected first passage time 
of a random walk from a given node, when it starts from a 
random node of the graph. The measure we study in this paper 
generalizes the notion of random-walk centrality to a set of 
absorbing nodes. 

Absorbing random walks have been used in previous work 
to select a diverse set of nodes from a graph. For example, 
an algorithm proposed by Zhu et al. [18] selects nodes in the 
following manner: (i) the first node is selected based on its 
PageRank value and is set as absorbing; {ii) the next node to be 
selected is the node that maximizes the expected first-passage 
time from the already selected absorbing nodes. Our problem 
definition differs considerably from the one considered in that 
work, as in our work the expected first-passage times are 
always computed from the set of query nodes that are provided 
in the input, and not from the nodes that participate in the 
solution so far. In this respect, the greedy method proposed 
by Zhu et al. is not associated with a crisp problem definition. 


Another conceptually related line of work aims to select a 
diverse subset of query results, mainly within the context of 
document retrieval [1], [2], [16]. The goal, there, is to select k 
query results to optimize a function that quantifies the trade-off 
between relevance and diversity. 

Our work is also remotely related to the problem studied 
by Leskovec et al. on cost-effective outbreak detection [12]. 
One of the problems discussed there is to select nodes in the 
network so that the detection time for a set of cascades is 
minimized. However, their work differs from ours on the fact 
that they consider as input a set of cascades, each one of finite 
size, while in our case the input consists of a set of query nodes 
and we consider a probabilistic model that generates random 
walk paths, of possibly infinite size. 

III. Problem definition 

We are given a graph G = {V,E) over a set of nodes V 
and set of undirected edges E. The number of nodes \V\ is 
denoted by n and the number of edges \E\ by m. The input 
also includes a subset of nodes Q C V, to which we refer as 
the query nodes. As a special case, the set of query nodes Q 
may be equal to the whole set of nodes, i.e., Q = V. 

Our goal is to find a set C oi k nodes that are central with 
respect to the query nodes Q. For some applications it makes 
sense to restrict the central nodes to be only among the query 
nodes, while in other cases, the central nodes may include any 
node in V. To model those different scenarios, we consider a 
set of candidate nodes D, and require that the k central nodes 
should belong in this candidate set, i.e., CCD. Some of the 
cases include D = Q, D = V, or D = V\Q, but it could also 
be that D is defined in some other way that does not involve 
Q. In general, we assume that D is given as input. 

The centrality of a set of nodes C with respect to query 
nodes Q is based on the notion of absorbing random-walks 
and their expected length. More specifically, let us consider 
a random walk on the nodes V of the graph, that proceeds 
at discrete steps: the walk starts from a node q ^ Q and, at 
each step moves to a different node, following edges in G, 
until it arrives at some node in G. The starting node q of 
the walk is chosen according to a probability distribution s. 
When the walk arrives at a node c G C for the first time, 
it terminates, and we say that the random walk is absorbed 
by that node c. In the interest of generality, and to allow 
for adjustable importance of query nodes in the centrality 
measure, we also allow the random walk to restart. Restarts 
occur with a probability a at each step of the random walk, 
where a is a parameter that is specified as input to the problem. 
When restarting, the walk proceeds to a query node selected 
randomly according to s. Intuitively, larger values of a favor 
nodes that are closer to nodes Q. 

We are interested in the expected length (i.e., number of 
steps) of the walk that starts from a query node g G Q until it 
gets absorbed by some node in G, and we denote this expected 
length by ac^ {G). We then define the absorbing random-walk 



centrality of a set of nodes C with respect to query nodes Q, by 

acQ(C) = y]s(g)ac«(C). 

q£Q 

The problem we consider in this paper is the following. 

Problem 1 (/c-arw-Centrality) We are given a graph 
G = {V^E), a set of query nodes Q ^V, a set of candidate 
nodes D C V, a starting probability distribution s over V 
such that s{v) = 0 if V G V \ Q, a restart probability a, and 
an integer k. We ask to find a set of k nodes C ^ D that 
minimizes acg(C), i.e., the expected length of a random walk 
that starts from Q and proceeds until it gets absorbed in some 
node in C. 

In cases where we have no reason to distinguish among the 
query nodes, we consider the uniform starting probability 
distribution s(g) = l/\Q\An fact, for simplicity of exposition, 
hereinafter we focus on the case of uniform distribution. 
However, we note that all our definitions and techniques 
generalize naturally, not only to general starting probability 
distributions s(g), but also to directed and weighted graphs. 


that the random walk visits node j having started from node i 
is given by the (i,j)-entry of the |T| x |T| matrix 

oo 

F = ^P^^ = (I-Ptt)“', (3) 

^=0 

which is known as the fundamental matrix of the absorbing 
random walk. Allowing the possibility to start the random walk 
at an absorbing node (and being absorbed immediately), we 
see that the expected length of a random walk that starts from 
node i and gets absorbed by the set C is given by the i-\h 
element of the following n x 1 vector 

L = Lc = Q ) 1, (4) 

where 1 is an T x 1 vector of all Is. We write L = to 
emphasize the dependence on the set of absorbing nodes C. 

The expected number of steps when starting from a node 
in Q and until being absorbed by some node in C is then 
obtained by summing over all query nodes, i.e., 

acQ(C)=s^Lc. (5) 


IV. Absorbing random walks 


A. Efficient computation of absorbing centrality 


In this section we review some relevant background on 
absorbing random walks. Specifically, we discuss how to 
calculate the objective function acg(C) for Problem 1. 

Let P be the transition matrix for a random walk, with 
P(i, j) expressing the probability that the random walk will 
move to node j given that it is currently at node i. Since 
random walks can only move to absorbing nodes C, but not 
away from them, we set P(c, c) = 1 and P(c, j) = 0, if j c, 
for all absorbing nodes c e C. The set T = V \ C of non¬ 
absorbing nodes is called transient. If N{i) are the neighbors 
of a node i ^ T and di = |A^(i)| its degree, the transition 
probabilities from node i to other nodes are 




as{j) if j eQ\N{i), 

(\-a)/di + as{j) if jeN{i). 


Here, s represents the starting probability vector. For example, 
for the uniform distribution over query nodes we have s(i) = 
l/IQI if i G Q and 0 otherwise. The transition matrix of the 
random walk can be written as follows 


P = 


Ptt 

0 



( 2 ) 


In the equation above, I is an (n — |T|) x (n — |T|) identity 
matrix and 0 a matrix with all its entries equal to 0; Ptt 
is the |T| X |T| sub-matrix of P that contains the transition 
probabilities between transient nodes; and Ptc is the |T| x |C| 
sub-matrix of P that contains the transition probabilities from 
transient to absorbing nodes. 

The probability of the walk being on node j at exactly I 
steps having started at node i, is given by the (i,j)-entry of 
the matrix P^^. Therefore, the expected total number of times 


Equation (5) pinpoints the difficulty of the problem we 
consider: even computing the objective function acg(C) for a 
candidate solution C requires an expensive matrix inversion; 
F = (I — Ptt)~^- Furthermore, searching for the optimal set 
C involves an exponential number of candidate sets, while 
evaluating each one of them requires a matrix inversion. 

In practice, we find that we can compute acg(C) much 
faster approximately, as shown in Algorithm 1. The algorithm 
follows from the infinite-sum expansion of Equation (5). 


acQ(C) = s^Lc = 


1 = s' 


Z ^^=0 ^TT 


= S 


T 



^TT 

0 




£=0 


with 


T 

xq = s and 



( 6 ) 


Note that computing each vector x^ requires time Ofn?). 
Algorithm 1 terminates when the increase of the sum due to 
the latest term falls below a pre-defined threshold e. 


V. Problem characterization 

We now study the /c-arw-Centrality problem in more 
detail. In particular, we show that the function acg is mono¬ 
tone and supermodular, a property that is used later to provide 
an approximation guarantee for the greedy algorithm. We also 
show that /c-arw-Centrality is NP-hard. 



Algorithm 1 ApproximateAC 

Input: Transition matrix Ptt, threshold e, 
starting probabilities s 
Output: Absorbing centrality acg 
Xo ^ 

^ ^ Xq • 1 

ac ^ ^ 

while ^ < e do 

/ \ 
x^+i ^ Q 1 

^ ^ x ^+1 ' 1 

ac ^ ac + ^ 

return ac 


Recall that a function / : 2^ ^ M over subsets of a ground 
set V is submodular if it has the diminishing returns property 

/(yuM)-/(r)</(xuM)-/(x), (?) 

for all X C y C V and u ^ Y. The function / is super- 
modular if — / is submodular. Submodularity (and supermod¬ 
ularity) is a very useful property for designing algorithms. For 
instance, minimizing a submodular function is a polynomial¬ 
time solvable problem, while the maximization problem is 
typically amenable to approximation algorithms, the exact 
guarantee of which depends on other properties of the function 
and requirements of the problem, e.g., monotonicity, matroid 
constraints, etc. 

Even though the objective function acg(C) is given in 
closed-form by Equation (5), to prove its properties we find 
it more convenient to work with its descriptive definition, 
namely, acg(C) being the expected length for a random walk 
starting at nodes of Q before being absorbed at nodes of C. 

For the rest of this section we consider that the set of query 
nodes Q is fixed, and for simplicity we write ac = acg. 

Proposition 1 (Monotonicity) For all X C Y C V it is 

ac(y) < ac(X). 

The proposition states that absorption time decreases with 
more absorbing nodes. The proof is given in the Appendix. 

Next we show that the absorbing random-walk centrality 
measure ac(-) is supermodular. 

Proposition 2 (Supermodularity) For all sets X C Y C V 
and u ^Y it is 

ac(X) — ac(X U {i^}) > ac(y) — ac(y U (8) 

Proof: Given an instantiation of a random walk, we define 
the following propositions for any pair of nodes G V, non¬ 
negative integer i, and set of nodes Z: 

Afj (Z): The random walk started at node i and visited node j 
after exactly £ steps, without visiting any node in set Z. 


Bfj{Z,u): The random walk started at node i and visited 
node j after exactly £ steps, having previously visited 
node u but without visiting any node in the set Z. 

It is easy to see that the set of random walks for which Afj (Z) 
is true can be partitioned into those that visited u within the 
first £ steps and those that did not. Therefore, the probability 
that proposition Alj{Z) is true for any instantiation of a 
random walk generated by our model is equal to 

Pr [A^iZ)] = Pr [AIj{Z U {u})] + Pr t^)] . (9) 

Now, let A(Z) be the number of steps for a random walk 
to reach the nodes in Z. A(Z) is a random variable and its 
expected value over all random walks generated by our model 
is equal to ac(Z). Note that the proposition A{Z) > £ Y I 
is true for a given instantiation of a random walk only if 
there is a pair of nodes g G Q and j e V \ Z, for which the 
proposition A^^ -{Z) is true. Therefore, 

Vr[A{Z)>^ + l] = Y, E (10) 

qeQjev\z 

From the above, it is easy to calculate ac(Z) as 
ac{Z) = E[A{Z)] 

OO 

= Y.^Pt[A{Z)=£] 

£=0 

OO 

= Y.Pv[A{Z)>£] 

£=1 

OO 

= Y.Pv[A{Z)>£+1] 

£=0 

CO 

= EE E PG4i(^)]- (11) 

£=o qeQ jev\z 

The final property we will need is the observation that, for 
X CY, Blj{Y,u) implies BE{X,u) and thus 

Pr [bYX,u)] >Pr[BljiY,u)]. (12) 

By using Equation (11), the Inequality (8) can be rewritten as 

CX) 

EE E 

£=o qeQ jev\x 

CO 

EE E PrWj(XU {..})] 

£=0 qeQ ieV\{XU{u}} 

CX) 

^EE E pG4i(p)]- 

£=0 qeQ jeV\Y 

CO 

EE E pr[4tPLiW)]- (13) 

£=0 qeQ ieV\{V^U{u}} 



( 16 ) 


We only need to show that the inequality holds for an arbitrary 
value of I and q G Q, that is 

E - E Pr [Al,{X U {n})] > 

jev\x jev\{xu{«}} 

E pr Kay)] - E p^ Kay u {n})]. 

jev\Y j6V\{yu{«}} 

(14) 

Notice that Pr [Af^{Y U {«})] = 0, so we can rewrite the 
above inequality as 

E Pr[<iW]- E PrK,(XuM)]> 

jev\x jev\x 

E p^[<i(p)]- E p^KAY^M)]- (15) 

jev\Y jev\Y 

To show the latter inequality we start from the left hand side 
and use Inequality (12). We have 

^ Pr[A^(X)]- ^ Pv[AA{XU{u})] 

jev\x jev\x 

= E p^Kayu)] 

jev\x 

> p^KAyK 

jev\Y 

= p^[4i(p)]- E pr[4i(^uM)], 

jev\Y jev\Y 

which completes the proof. ■ 

Finally, we establish the hardness of k absorbing centrality, 
defined in Problem 1. 

Theorem 1 The /c-ARW-Centrality problem is ^ V - hard . 

Proof : We obtain a reduction from the VertexCover 
problem [6]. An instance of the VertexCover problem is 
specified by a graph G = {V^E) and an integer k, and asks 
whether there exists a set of nodes C C V such that \C\ < 
k and C is a vertex cover, (i.e., for every (i,jf) G it is 
{i,j} n C 7 ^ 0). Let \V\ = n. 

Given an instance of the VertexCover problem, we 
construct an instance of the decision version of k-ARW- 
Centrality by taking the same graph G = {V^E) with 
query nodes Q = V and asking whether there is a set of 
absorbing nodes G such that \G\ <k and ^q{c) <i-A 
We will show that C is a solution for VertexCover if 
and only if acQ(C) < 1 — 

Assuming first that C is a vertex cover. Consider a random 
walk starting uniformly at random from a node v ^ Q = V. 
\f V ^ G then the length of the walk will be 0, as the walk 
will be absorbed immediately. This happens with probability 
|C|/|V| = k/n. Otherwise, if v ^ G the length of the walk 
will be 1, as the walk will be absorbed in the next step (since 
C is a vertex cover all the neighbors of v need to belong in 
G). This happens with the rest of the probability 1 — k/n. 


Thus, the expected length of the random walk is 

^ n \ n J n 

Conversely, assume that G is not a vertex cover for G. Then, 
there should be an uncovered edge {u^v). A random walk that 
starts in u and then goes to v (or starts in v and then goes to u) 
will have length at least 2, and this happens with probability 
at least - ^. Then, following a similar reasoning as 

in the previous case, we have 


acQ (C) = 


> 


CX) 

k Pr (absorbed in exactly k steps) 

k =0 


oo 

Pr (absorbed after at least k steps) 

k=l 



+ 


2 ^ k 

n 


(17) 


VI. Algorithms 

This section presents algorithms to solve the k - ARXf - 
Centrality problem. In all cases, the set of query nodes 
Q C V is given as input, along with a set of candidate nodes 
DEV and the restart probability a. 


A. Greedy approach 

The first algorithm is a standard greedy algorithm, denoted 
Greedy, which exploits the supermodularity of the absorbing 
random-walk centrality measure. It starts with the result set G 
equal to the empty set, and iteratively adds a node from the 
set of candidate nodes D, until k nodes are added. In each 
iteration the node added in the set G is the one that brings the 
largest improvement to acg. 

As shown before, the objective function to be minimized, 
i.e., acg, is supermodular and monotonically decreasing. The 
Greedy algorithm is not an approximation algorithm for this 
minimization problem. However, it can be shown to provide 
an approximation guarantee for maximizing the absorbing 
centrality gain measure, defined below. 

Definition 1 (Absorbing centrality gain) Given a graph G, 
a set of query nodes Q, and a set of candidate nodes D, the 
absorbing centrality gain of a set of nodes G E D is defined 
as 

acggiC) = mg - acg(C), 
where mg = min„g£.{acQ({i;})}. 

Justification of the gain function. The reason to define 
the absorbing centrality gain is to turn our problem into a 
submodular-maximization problem so that we can apply stan¬ 
dard approximation-theory results and show that the greedy 
algorithm provides a constant-factor approximation guarantee. 
The shift mg quantifies the absorbing centrality of the best 
single node in the candidate set. Thus, the value of acgg(C) 
expresses how much we gain in expected random-walk length 



when we use the set C as absorbing nodes compared to when 
we use the best single node. Our goal is to maximize this gain. 

Observe that the gain function acgg is not non-negative ev¬ 
erywhere. Take for example any node u such that acg({i4}) > 
mg. Then, acgg({i4}) < 0. Note also that we could have 
obtained a non-negative gain function by defining gain with 
respect to the worst single node, instead of the best. In other 
words, the gain function acgg(C) = Mg — acg(C), with 
Mg = max^^i){acg({i;})}, is non-negative everywhere. 

Nevertheless, the reason we use the gain function acgg 
instead of acgg is that acgg takes much larger values than 
acgg, and thus, a multiplicative approximation guarantee on 
acgg is a weaker result than a multiplicative approximation 
guarantee on acgg. On the other hand, our definition of 
acgg creates a technical difficulty with the approximation 
guarantee, that is defined for non-negative functions. Luckily, 
this difficulty can be overcome easily by noting that, due to the 
monotonicity of acgg, for any k > 1, the optimal solution of 
the function acgg, as well as the solution returned by Greedy, 
are both non-negative. 

Approximation guarantee. The fact that the Greedy algo¬ 
rithm gives an approximation guarantee to the problem of 
maximizing absorbing centrality gain is a standard result from 
the theory of submodular functions. 

Proposition 3 The function acgg is monotonically increasing, 
and submodular. 

Proposition 4 Let k > 1. For the problem of finding a set 
C ^ D with \C\ < k, such that acgg((7) is maximized, the 
Greedy algorithm gives a {l — ^^-approximation guarantee. 

We now discuss the complexity of the Greedy algorithm. 
A naive implementation requires computing the absorbing 
centrality acg(C) using Equation (5) for each set C that 
needs to be evaluated during the execution of the algorithm. 
However, applying Equation (5) involves a matrix inversion, 
which is a very expensive operation. Eurthermore, the number 
of times that we need to evaluate acg(C) is 0{k\D\), as 
for each iteration of the greedy we need to evaluate the 
improvement over the current set of each of the 0{\D\) 
candidates. The number of candidates can be very large, e.g., 
\D\ = n, yielding an 0{kn^) algorithm, which is prohibitively 
expensive. 

We can show, however, that we can execute Greedy sig¬ 
nificantly more efficiently. Specifically, we can prove the 
following two propositions. 

Proposition 5 Let Ci-i be a set of i — 1 absorbing nodes, 
Pi_i the corresponding transition matrix, and let F^_i = 
(I — Let Ci = Ci-i U {i^}. Given F^_i the value 

acg((7i) can be computed in 0{in?). 

Proposition 6 Let C be a set of absorbing nodes, P the 
corresponding transition matrix, and F = (I — P)“^. Let 
C' = C — {i;} U {u}, u^v e C. Given F the value acg(C') 
can be computed in time 0{n?). 

The proofs of these two propositions can be found in the Ap- 


Algorithm 2 Greedy 

Input: graph G, query nodes Q, candidates D, k>l 

Output: a set of k nodes C 

Compute acg({i;}) for arbitrary v e D 

Eor each u e {D — {^}), use Prop.6 to compute acg(i4) 

Select ui e D s.t. ui ^ argmax^^^:) acg(i4) 

Initialize solution C ^ 
for i = 2..k do 

Eor each u e D, use Prop.5 to compute acg(G U {i^}) 
Select Ui G D s.t. Ui ^ SLigmayiuieiD-c) acg(GU{i4}) 

Update solution C <- C U {ui} 

return C 


pendix. Proposition 5 implies that in order to compute acg (Ci) 
for absorbing nodes Ci in O(n^), it is enough to maintain the 
matrix F^_i, computed in the previous step of the greedy al¬ 
gorithm for absorbing nodes Ci-i. Proposition 6, on the other 
hand, implies that we can compute the absorbing centrality 
of each set of absorbing nodes of a fixed size i in O(n^), 
given the matrix F, which is computed for one arbitrary set of 
absorbing nodes C of size i. Combined, the two propositions 
above yield a greedy algorithm that runs in 0{kn^) and offers 
the approximation guarantee discussed above. We outline it as 
Algorithm 2. 

Practical speed-up. We found that the following heuristic lets 
us speed-up Greedy even further, with no significant loss in 
the quality of results. To select the first node for the solution 
set C (see Algorithm 2), we calculate the Page Rank values of 
all nodes in D and evaluate acg only for the t « k nodes 
with highest PageRank score, where t is a fixed parameter. 
In what follows, we will be using this heuristic version of 
Greedy, unless explicitly stated otherwise. 

B. Efficient heuristics 

Even though Greedy runs in polynomial time, it can be 
quite inefficient when employed on moderately sized datasets 
(more than some tens of thousands of nodes). We thus describe 
algorithms that we study as efficient heuristics for the problem. 
These algorithms do not offer guarantee for their performance. 
Spectral methods have been used extensively for the problem 
of graph partitioning. Motivated by the wide applicability 
of this family of algorithms, here we explore three spectral 
algorithms: SpectralQ, SpectralC, and SpectralD. We start by 
a brief overview of the spectral method; a comprehensive 
presentation can be found in the tutorial by von Luxburg [17]. 

The main idea of spectral approaches is to project the 
original graph into a low-dimensional Euclidean space so 
that distances between nodes in the graph correspond to Eu¬ 
clidean distances between the corresponding projected points. 
A standard spectral embedding method, proposed by Shi 
and Malik [15], uses the “random-walk” Laplacian matrix 
\jG = I — D“^A of a graph G, where A is the adjacency 
matrix of the graph, and forms the matrix U = [ 1 ^ 2 , • • •, ^d+i] 
whose columns are the eigenvectors of Lg that correspond to 



the smallest eigenvalues A 2 < ... < with d being the 

target dimension of the projection. The spectral embedding is 
then defined by mapping the i-th node of the graph to a point 
in which is the i-row of the matrix U. 

The algorithms we explore are adaptations of the spectral 
method. They all start by computing the spectral embedding 
(j) : V SiS described above, and then, proceed as follows: 

SpectralQ performs /c-means clustering on the embeddings of 
the query nodes, where k is the desired size of the result set. 
Subsequently, it selects candidate nodes that are close to the 
computed centroids. Specifically, if Si is the size of the i-th 
cluster, then ki candidate nodes are selected whose embedding 
is the nearest to the i-th centroid. The number ki is selected 
so that ki oc Si and ^ki = k. 

SpectralC is similar to SpectralQ, but it performs the /c-means 
clustering on the embeddings of the candidate nodes, instead 
of the query nodes. 

Spectral D performs /c-means clustering on the embeddings of 
the query nodes, where k is the desired result-set size. Then, 
it selects the k candidate nodes whose embeddings minimize 
the sum of squared ^ 2 -distances from the centroids, with no 
consideration of the relative sizes of the clusters. 
Personalized Pagerank (PPR). This is the standard Pager- 
ank [4] algorithm with a damping factor equal to the restart 
probability a of the random walk and personalization prob¬ 
abilities equal to the start probabilities s(g). Algorithm PPR 
returns the k nodes with highest PageRank values. 

Degree and distance centrality. Finally, we consider the 
standard degree and distance centrality measures. 

Degree returns the k highest-degree nodes. Note that this 
baseline is oblivious to the query nodes. 

Distance returns the k nodes with highest distance centrality 
with respect to Q. The distance centrality of a node u is 

defined as dc(i4) = (^y^Q d{u^v)^ 

VII. Experimental evaluation 

A. Datasets 

We evaluate the algorithms described in Section VI on two 
sets of real graphs: one set of small graphs that allows us 
to compare the performance of the fast heuristics against the 
greedy approach; and one set of larger graphs, to compare the 
performance of the heuristics against each other on datasets 
of larger scale. Note that the bottleneck of the computation 
lies in the evaluation of centrality. Even though the technique 
we describe in Section IV-A allows it to scale to datasets 
of tens of thousands of nodes on a single processor, it is 
still prohibitively expensive for massive graphs. Still, our 
experimentation allows us to discover the traits of the different 
algorithms and understand what performance to anticipate 
when they are employed on graphs of massive size. 

The datasets are listed in Table I. Small graphs are obtained 
from Mark Newman’s repository\ larger graphs from SNAP.^ 

^http://www-personaLumich.edu/%7Emejn/netdata/ 

^ http:// snap. Stanford. edu/data/index. html 


TABLE I: Dataset statistics 


Dataset 

1^1 

\E\ 

karate 

34 

78 

dolphins 

62 

159 

lesmis 

77 

254 

adjnoun 

112 

425 

football 

115 

613 

kddCoauthors 

2 891 

2 891 

livejournal 

3 645 

4141 

ca-GrQc 

5 242 

14 496 

ca-HepTh 

9 877 

25 998 

roadnet 

10199 

13 932 

oregon-1 

11174 

23 409 


For kddCoauthors, live journal, and roadnet we use 
samples of the original datasets. In the interest of repeatability, 
our code and datasets are made publicly available.^ 

B. Evaluation Methodology 

Each experiment in our evaluation framework is defined 
by a graph G, a set of query nodes Q, a set of candidate 
nodes D, and an algorithm to solve the problem. We evaluate 
all algorithms presented in Section VI. For the set of candidate 
nodes D, we consider two cases: it is equal to either the set of 
query nodes, i.e., D = Q, or the set of all nodes, i.e., D = V. 

Query nodes Q are selected randomly, using the following 
process: First, we select a set S of s seed nodes, uniformly 
at random among all nodes. Then, we select a ball B{v,r) 
of predetermined radius r = 2, around each seed v G S.^ 
Finally, from all balls, we select a set of query nodes Q of 
predetermined size q, with g = 10 and g = 20, respectively, 
for the small and larger datasets. Selection is done uniformly 
at random. 

Finally, the restart probability a is set to a = 0.15 and the 
starting probabilities s are uniform over Q. 

C. Implementation 

All algorithms are implemented in Python using the Net- 
workX package [8], and were run on an Intel Xeon 2.83GHz 
with 32GB RAM. 

D. Results 

Figure 1 shows the centrality scores achieved by different 
algorithms on the small graphs for varying k (note: lower is 
better). We present two settings: on the left, the candidates are 
all nodes (D = V), and on the right, the candidates are only 
the query nodes (D = Q). We observe that PPR tracks well 
the quality of solutions returned by Greedy, while Degree and 
Distance often come close to that. Spectral algorithms do not 
perform that well. 

Figure 2 is similar to Figure 1, but results on the larger 
datasets are shown, not including Greedy. When all nodes are 
candidates, PPR typically has the best performance, followed 
by Distance, while Degree is unreliable. The spectral algo¬ 
rithms typically perform worse than PPR. 

^https://github.com/harrymvr/absorbing-centrality 

^Eor the planar roadnet dataset we use r = 3. 



When only query nodes are candidates, all algorithms 
demonstrate similar performance, which is most typically 
worse than the performance of PPR (the best performing 
algorithm) in the previous setting. Both observations can be 
explained by the fact that the selection is very restricted by 
the requirement D = Q, and there is not much flexibility for 
the best performing algorithms to produce a better solution. 

In terms of running time on the larger graphs. Distance 
returns within a few minutes (with observed times between 15 
seconds to 5 minutes) while Degree returns within seconds (all 
observed times were less than 1 minute). Finally, even though 
Greedy returns within 1-2 seconds for the small datasets, it 
does not scale well for the larger datasets (running time is 
orders of magnitude worse than the heuristics and not included 
in the experiments). 

Based on the above, we conclude that PPR offers the best 
trade-off of quality versus running time for datasets of at least 
moderate size (more than 10 k nodes). 

VIII. Conclusions 

In this paper, we have addressed the problem of finding 
central nodes in a graph with respect to a set of query nodes Q. 
Our measure is based on absorbing random walks: we seek 
to compute k nodes that minimize the expected number of 
steps that a random walk will need to reach at (and be 
“absorbed” by) when it starts from the query nodes. We have 
shown that the problem is NP-hard and described an 0{kn^) 
greedy algorithm to solve it approximately. Moreover, we 
experimented with heuristic algorithms to solve the problem on 
large graphs. Our results show that, in practice, personalized 
PageRank offers a good combination of quality and speed. 
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Appendix 

A. Proposition 1 

Proposition (Monotonicity) For all X ^ Y C V it is 

ac(y) < ac(X). 

Proof: Write Gx for the input graph G where the set X 
are absorbing nodes. Define Gy similarly. Let Z = Y \ X. 
Consider a path p in Gx drawn from the distribution induced 
by the random walks on Gx- Let Pr [p] be the probability of 
the path and i{p) its length. Let V{X) and V{Y) be the set 
of paths on Gx and Gy- Finally, let V{Z,X) be the set of 
paths on Gx that pass from Z, and 7^(Z, X) the set of paths 
on Gx that do not pass from Z. We have 

ac(U = Y Pt\p]£{p) 

pevix) 

= Y Pr[p]£(p)+ Y P^\p\hp) 

pev(z,x) pev{z,x) 

> Pr [p] £{p) 

pev(Y) 

= ac(V), 

where the inequality comes from the fact that a path in Gx 
passing from Z and being absorbed by X corresponds to a 
shorter path in Gy being absorbed by V. ■ 

B. Proposition 5 

Proposition Let Gi-i be a set of i — 1 absorbing nodes, 
P^_i the corresponding transition matrix, and F^_i = (I — 
Pi_i)“^. Let Gi = Gi-i U {fx}. Given F^_i, the centrality 
score ACqiCi) can be computed in time 0{nf). 

The proof makes use of the following lemma. 

Lemma 1 (Sherman-Morrison Formula [7]) Let XV be a 

square nxn invertible matrix and its inverse. Moreover, 
let a and b be any two column vectors of size n. Then, the 
following equation holds 

(M + ab^)“i = M“i - M“iab^M“V(l + b^M-^a). 

Proof: Without loss of generality, let the set of absorbing 
nodes be Gi-i = {1, 2,..., x — 1}. As in Section VI, the 
expected number of steps before absorption is given by the 
formulas 

acQ(Ci_i) = s^Fi_il, 




Notice that 


with F^_i = A^_\ and A^_i = I — P^-i. 


We proceed to show how to increase the set of absorbing nodes 
by one and calculate the new absorption time by updating F^_i 
in O(n^). Without loss of generality, suppose we add node i 
to the absorbing nodes Ci-i, so that 

Ci = Ci-i U {i} = {1, 2,..., i - 1, i). 

Let be the transition matrix over G with absorbing nodes 
Ci. Like before, the expected absorption time by nodes Ci is 
given by the formulas 

acQ(Ci) = s^Fil, 

with F^ = A^^ and A^ = I — P^. 

Notice that 


A, - A,_i = (I - P,) - (I - P,_i) = P,_i - P, 


0(i —1) xn 
• • • Pi,n 
0(n—i) xn 


= ab^ 


where pij denotes the transition probability from node i to 
node j in transition matrix P^-i, and the column-vectors a 
and b are defined as 


a 

b 


i—l n—i 

1 0 ... 0], and 


By a direct application of Lemma 1, it is easy to see that we 
can compute F^ from F^_i with the following formula, at a 
cost of 0{n^) operations. 


A' - A = (I - P ) - (I - P) = P - P' 

b(i —1) xn 
Pip • • • Pi,r] 

Pi-\-l,0 • • • Pi-\-l,n 


0 


(n—i—1) xn 


= a2b|’ - aibf 


where pij denotes the transition probability from node i to 
node j in a transition matrix Pq where neither node i or i-\-l 
is absorbing, and the column-vectors ai, bi, a 2 , b 2 are defined 
as 


ai 

bi 


a2 

b2 


i—l n—i—1 

[ 0^?~0 1 0 0 ^^ 

[Pi,l ••• Pi,n] 

i—l n—i—1 

[0^^ 0 1 0^^ 

[Pi+1,1 • • • Pi-\-l,n]‘ 


By an argument similar with the one we made in the proof 
of Proposition 5, we can compute F' in the following two 
steps from F, each costing 0{in?) operations for the provided 
parenthesization 

Z = F-(Za 2 )(bi’Z)/(l + bi’(Za 2 )), 

F' = Z + (Fai)(b?’F)/(l + b?’(Fai)). 

We have thus shown that, given F, we can compute F', and 
therefore acQ(C') as well, in time O(n^). ■ 


Fi = Fi_i-(Fi_ia)(b^Fi_i)/(l + b^(Fi_ia)) 


We have thus shown that, given F^-i, we can compute F^, 
and therefore acQ(Ci) as well, in O(n^). ■ 


C. Proposition 6 

Proposition Let C be a set of absorbing nodes, P the 
corresponding transition matrix, and F = (I — P)“^. Let 
C = C — {v} [J {u}, for u,v G C. Given F, the centrality 
score acg((7') can be computed in time 0{n?). 

Proof: The proof is similar to the proof of Proposition 5. 
Without loss of generality, let the two sets of absorbing nodes 
be 


C = and 

Let P' be the transition matrix with absorbing nodes C' . The 
absorbing centrality for the two sets of absorbing nodes C and 
C is expressed as a function of the following two matrices 

F = A“^, with A = I — P, and 

F' = A'“\ with A' = (I - P'). 
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Fig. 1: Results on small datasets for varying k and 5 = 2. 
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Fig. 2: Results on large datasets for varying k and 
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